Discovering data quality rules
نویسندگان
چکیده
Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code. In this work, we propose a new data-driven tool that can be used within an organization’s data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records). We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.
منابع مشابه
A Distributed-Population Genetic Algorithm for Discovering Interesting Prediction Rules
In data mining the quality of prediction rules basically involves three criteria: accuracy, comprehensible and interestingness. The majority of the rule induction literature focuses on discovering accurate, comprehensible rules. In this paper we also take these two criteria into account, but we go beyond them in the sense that we aim at discovering rules that are interesting (surprising) for th...
متن کاملDiscovering Data Quality Rules in a Master Data Management
Dirty data continues to be an important issue for companies. The datawarehouse institute [Eckerson, 2002], [Rockwell, 2012] stated poor data costs US businesses $611 billion dollars annually and erroneously priced data in retail databases costs US customers $2.5 billion each year. Data quality becomes more and more critical. The database community pays a particular attention to this subject whe...
متن کاملDiscovering Non-Redundant Association Rules using MinMax Approximation Rules
Dept. Of Comp. Sci. & Eng. Vaagdevi college of Eng. Warangal, India [email protected] Abstract Frequent pattern mining is an important area of data mining used to generate the Association Rules. The extracted Frequent Patterns quality is a big concern, as it generates huge sets of rules and many of them are redundant. Mining Non-Redundant Frequent patterns is a big concern in the area of Ass...
متن کاملAutomating Objective Data Quality Assessment (experiences
The paper discusses the design goals, current architecture and overall experiences gained in the process of building a software tool aiding human analysts in estimating approximate information quality (information accuracy) of an unknown relational data source. We discuss the algorithms and techniques that we found effective. In particular, we discuss the automated reasoning techniques used to ...
متن کاملCost of Low-Quality Data over Association Rules Discovery
Quality in data mining critically depends on the preparation and on the quality of processed data sets. Indeed data mining processes and applications require various forms of data preparation (and repair) with several data formatting and cleaning techniques, because the data input to the mining algorithms is assumed to conform to nice data distributions, containing no missing, inconsistent or i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 1 شماره
صفحات -
تاریخ انتشار 2008